prometheus metrics for http webhook publishers#1149
Conversation
|
Hello @v1r3n please review this |
There was a problem hiding this comment.
Thanks for the submission! There are a few things to clean up first...
First, a bit of a consistency improvement to make here:
recordWebhookQueueDepth takes (notificationType, size) but the other three methods all carry a name parameter (workflow/task definition name). That means you can slice success and failure by name but not queue depth. Either add name to recordWebhookQueueDepth to match, or drop it from the others — whichever feels right for how you'd actually query these in Grafana.
nthmost-orkes
left a comment
There was a problem hiding this comment.
Quick question: can you double-check that webhook_enqueue_failure will actually fire in practice?
Both publishers use blockingQueue.put(), which blocks when the queue is full rather than throwing — so the only exception it can raise is InterruptedException (thread interrupted). If the queue fills up under load, put() will stall the caller but the failure metric will stay silent.
If the intent is to detect "queue full, item dropped," offer() is the right call — it returns false immediately when the queue is at capacity, giving you a clean place to record the metric and log a warning. Happy to be corrected if there's a different failure mode you had in mind.
nthmost-orkes
left a comment
There was a problem hiding this comment.
One more observation, non-blocking but worth doing right: the webhook_queue_depth gauge records a snapshot at enqueue time via AtomicDouble.set(). If the publisher thread drains items between enqueues, the gauge can read much higher than the actual depth at scrape time — you'd see a sawtooth that never fully reflects the live queue state.
A pull-style registration in the constructor gives you a live reading at every Prometheus scrape instead:
Gauge.builder("webhook_queue_depth", blockingQueue, Collection::size)
.tag("notificationType", NOTIFICATION_TYPE)
.register(Metrics.globalRegistry);Then no call needed at enqueue time at all. The existing event_queue_depth gauge in Monitors has the same limitation, so this PR is at least consistent — but since you're adding new infrastructure here it's a good opportunity to set a better pattern.
|
Hi @nthmost-orkes
|
Thanks for the update -- can you increase the test coverage on this feature? Then we should be good to go. The three counters (webhook_publish_success / _failure / _enqueue_failure) don't have tests yet — only the gauge does. The counter ones are cheap to add the same way the gauge test works: add a SimpleMeterRegistry probe, call the method, assert the tagged counter incremented (including the defaultIfBlank → "unknown" fallback). And TaskStatusPublisher has no test file at all today, so all of its new wiring is uncovered. |
Summary
Fixes #1073
Adds Prometheus/Micrometer metrics for HTTP webhook publishers (
workflow_publisherandtask_publisher), following the sameMonitorspattern used by archive listeners (e.g.recordWorkflowArchived).New metrics:
webhook_publish_success— HTTP 200/202 after publishwebhook_publish_failure— exception during publishwebhook_enqueue_failure— offer() returns false when buffer fullwebhook_queue_depth— in-memory notification queue size after enqueueTest plan
conductor.workflow-status-listener.type=workflow_publisherandconductor.task-status-listener.type=task_publisherwith a reachable webhook URL/actuator/prometheus:webhook_publish_success_totalwebhook_queue_depthwebhook_publish_failure_total